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How good are global Newton methods. Part 1 
A. A. Goldstein* 

ABSTRACT: 1) Relying on a theorem of Nemerovsky and Yuden(1979) a lower bound 
is given for the efficiency of global Newton methods over the class C 1 (fi , A) defined below. 
2) The efficiency of Smale’s global Newton method in a simple setting with a non-singular, 
Lipschitz-continuous Jacobian is considered. The efficiency is characterized by 2 param- 
eters, the condition number Q and the smoothness S, defined below. The efficiency is 
sensitive to S, as ’ insensitive to Q. 

KEYWORDS: Global Newton methods, unconstrained optimization, computational com- 
plexity 

Global Newton methods are considered by some to be methods for minimizing a “strongly” 
convex function f defined on a real Hilbert space E. Strongly convex means that f is twice 
differentiable with a Hessian that is bounded from above and below. By C(^, A) we denote 
the set of all strongly convex functions whose Hessian is bounded below by /x and above 
by A. The Hessian is invertible so that Newton’s method is well defined for every point in 
E. Moreover a strong convex function achieves a minimum, where V f(x) = 0. However 
Newton’s method may not converge to a root of V/(x) = 0 from arbitrary points in E. 
This is a raison d'etre for the Global Newton methods. These methods, whose ingredients 
contain Newton steps, generate sequences that converge for every strongly convex functions 
and any starting point in E. The convergence rate is asymptotically superlinear. An 
early history of this subject may be found in Polak(1973), who cites contributions of 
Goldstein(1965), Pshenichnyi(1970), and Robinson(1972). More recent work is due to 
Bertsekas(1982), Dunn(1980), Hughes and Dunn(1984), and others. All of these results 
give estimated asymptotic rates of convergence. Global Newton methods for finding roots 
go back to at least 1934. They are related to continuation methods. An early history 
and discussion may be found in Ortega and Rheinboldt(1970,p235), who credit the basic 
idea to Lahaye(1934,1948). Current references may be found in Smale( 1986-2). In general 
we regard a global Newton method as any algorithm incorporating Newton steps that 
that generates a finite sequence terminating in an approximate root. This is a point from 
which the ordinary Newton’s method will converge. Other algorithms are available that 
terminate in an approximate root. The efficiency or iteration count of 2 such algorithms 
will be compared to a global Newton method. The word “algorithm” as used in this paper 
should be taken with “a grain of salt”. We assume information that is not given with real 

* supported by grants NIH RR01243-05 AND NPS LMC-M4E1 
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problems. Our excuse for doing this is that we hope thereby to gain insight and motivation 
for the future construction of good algorithms. 

The efficiency of a Global Newton method was probably first analyzed by Kung(1976), 
using natural assumptions that imply a non-vanishing Jacobian. It appears that the next 
such result is due to Smale(1986-2) who established a global Newton method in the general 
setting of an analytic mapping between Banach spaces, both real or both complex. We 
revisit this problem below. Our assumptions are close to Kung’s but our algorithm follows 
Smale. The first part of this paper will show that the class of strongly convex functions 
and thus any more generalized classes that include the strongly convex functions are not a 
suitable setting for Newton’s method; hence, also not suitable for global Newton methods. 
Unfortunately, this is the setting for the asymptotic convergence proofs mentioned above. 

Consider the class C 1 (/u,A) having the following definition. Let F be a continuously dif- 
ferentiable map from a separable real Hilbert space H into itself. The inner product in H 
will be denoted by [ , ]. Let D(x) denote the Frechet derivative of F at x. By C 1 (/u,A) 
we denote the set of all maps F for which yu 1 1 /i 1 1 < ||U(x)/i|| < A||/i|| for all h, x € H. 
with /i > 0 . Let Q = Assume that the linear operator D(x) has an inverse. We 
shall show that no global Newton method (or any other algorithm) can do better than 
linear convergence at a certain determined rate over every member of the above class. Any 
algorithm that can achieve this rate is called an optimal algorithm. For the special case 
of C(fi, A) a simple algorithm due to Nesterov (1983 )is optimal to within a multiplicative 
constant. The convergence rate is linear. Nesterov’s algorithm does not require inversions; 
-it is similar to the gradient method. Any application of Newton’s method requires the 
computation of an inverse operator or the solving a system of linear equations. If the 
dimension of H is small we usually are willing to pay the price of solving equations to gain 
the possibility of quadratic convergence. The convergence estimates for the global Newton 
method in the space C 1 (//, A, L ) that is a subset of C 1 (/i, A) with D(x) satisfying a uniform 
Lipschitz constant L show arbitrarily slow convergence for sufficiently large values of L. 
For the special case when D(x) is everywhere self-adjoint we exhibit a gradient algorithm 
whose efficiency is insensitive to L . For this case the estimate for the gradient method is 
superior to that of the global Newton method when L is sufficiently large. However the 
gradient method is sensitive to Q, while the global Newton method is not. Thus for fixed 
L and large enough values of Q the situation is reversed. 

It is a pleasure to thank Brad Bell for discussions and helpful criticisms. 
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REMARK 1. a). If F is 6 A) then any stationary point of ||F(a;)|| 2 is a root of F. 

PROOF Let f(x) = [F(x),F(x)j. The differential f'(x,h ) = 2[F(x),D(x)h], where D(x)h 
= F'(x,h). Let h = D~ 1 (x)F(x). If x is stationary then f'(x,h ) = 0 = 2[F(x), F(x)]. 
Whence F(x) = 0. 

b). EXISTENCE. In view of 1. above, if in addition f has compact level sets then F has 
roots. 

Let C(/x, A) denote the set of twice differentiable convex functions with 

V\\h\\ 2 < f"(x,h,h) < \\\h\\ 2 



for all x and h 6 H, and some positive fi < A. The class C(^z, A) is called a set of “strongly 
convex” functions. The number Q = A//i is called the condition number. 

ALGORITHMS By an algorithm A(g) where g £ C(^, A) we mean a recurrence relation 
that calculates £*+i using some of the values of g, g' and g n at x 3 , s=0,l,2,...,k, with x G 
arbitrarily given. A(g) is a special case of a “local method” defined by Nemerovsky and 
Yuden, 1981. By an algorithm B(F) defined on C 1 ^, A), we mean a recurrence relation 
that calculates Xfc+i using some of the values of V and F f at 0 < s < fc, with xo 
arbitrarily given. B(F) is also an instance of a local method. We shall assume that all 
global Newton methods are B(F) algorithms. 

Let C 5 (^, A) denote a subset of C 1 (//, A) for which D(x) is self-adjoint with spectral bounds 
\x and A for all x £ H. For F £ A) we can associate a “potential” function f (Vainberg 

1955) such that V/(ar) = F(x) for all x in H. (Actually, an equivalence class of functions, 
differing from each other by a constant). Also for every / £ C(/i, A) there corresponds a 
F £ C 3 (fi, A). The function f is weakly lower semi-continuous and the level sets of f are 
weakly sequentially compact. This, and the strong convexity of f implies that there exists 
a unique minimizer for f, say z. 



When any formula below is followed by the word “steps” we mean that the formula is to 
be rounded up to the nearest integer. We rely on the following claim. 



THEOREM l.)NEMEROVSKY-YUDEN(1979) Given a positive e < 1, a fixed but arbi- 
trary point Xq £ H and an algorithm A(f), there exists a function f £ C(^, A) such that 
if Xk generated by A(f) reduces (f(xk) — f(z)) to less than e(f(x 0 ) — f(z)) then k exceeds 
the number: 



c[min(n, \/Q)/(ln min(n, v/Q))]lni = Rln - steps 
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Here c is a positive constant, Q is > 2, and z = argmin f. 

REMARK 2. For n and Q sufficiently large and e sufficiently small the above bound may 
be increased to: 

c\[Q In ^ 

Given a positive e < 1 and any function f £ there is an algorithm A that yields 

f(xk) — f(z) < (f(xo) — f(z))e whenever k exceeds 4-y/Q (In 2) _1 lne -1 = R'lne -1 . This 
algorithm is due to Nesterov(1983). It is essentially optimal, and can only be improved by 
a decrease in the constant factor 4/ In 2. Stated otherwise, the algorithm A applied to any 
function f £ C(n, A) generates a sequence Xk that satisfies ( /(xfc ) — f(z))/(f(x o) — f(z )) < 
((g-t 1 /^ ))* 5 f or 1 < k < oo. 

The algorithm A (that wall be called GRAD1 below) may also be taken to be the gradient 
method with step length 1/A . This algorithm requires no information about the values 
of the function f, while Nesterov’s does. Observe that the linearly converging sequence 
( e -0/ft))fc, £ — 1,2,3,... is for each k a lower bound for the relative decrease of some 
function f in C(/x, A) at Xk , while the sequence ^) k is an upper bound for the 

relative decrease for any function in C(/x, A). For the gradient method above the sequence 
is (e-W) k . 

This prompts us to call the class C(//, A) “esslinearly convergent”, that is every function in 
the class can be made to converge no slower than linearly, but sup {(f(xk) — f( z ))/(f( x o) — 
f(z )) : / € C(n, A)}, k=l,2,3,....cannot converge faster than linearly. For brevity we shall 
refer to this latter property as “sublinearly convergent”. We now observe that the class 
C 3 (/i, A) is also esslinearly convergent . 

LEMMA 1 Given F £ C s (n, A), let f denote any potential function for F. Let z = argmin 
f. The following inequalities obtain: 



(2Q)-‘[(/(x) - /«)/(/(*„) - f(z )) | 1/2 < ||F(x)||/||F(*„)|| 
< 2Q[(f(x)- }(z))l(f(xo)- Hz))]'! 2 (A) 



Moreover, if 

/(x)-/(x)<e 2 (/(x„)-/( 2 ))/4Q 3 then ||F(x)|| < € ||F(x„)|| (B) 
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PROOF By the strong convexity of f and Taylor’s theorem we get: 



^\\ x ~ z f < f ( x ) - / 0 ) < ^\\ x ~ z f ( G ) 

By the generalized mean value theorem and the convexity of f we get: 

l|fWII<A||*-*|| (6) 

and 

f( x ) - f(z) <\\F( x )\\\\ x - z\\ (c) 

By (a) and (b) we have: 



\\H*)\\ 

lin*o)|| 



< A 



2 (/(*) - /(*)) 



1/2 



(d) 



By (a) and (c) we get 



and 






II^MII > |l|i-2|| (e) 



IW*)II ^yf(/w -/(^)) 1/2 (/) 

To prove (B) we find that using the hypotheses of (B) together with (d) that 



ll*X*)ll 

\\F( x o)\\ 



< A 



2* 2 (/(a:o) -/(;)) 

4pQ*\\F(x 0 )r 



1/2 



Using (e) we find that the right hand side is less than or equal to 



A 



2e*(f(x 0 )-f(z)) 



1/2 



4fiQ 3 /j, 2 \\x 0 — z\\ 2 /4 
Now using (a) the above expression is less than or equal to e. 



We now turn to the proof of (A). Using (f) and (d) we find that 

jl iifmii > (/(*)- /M ) 1/2 > / /(x) -/(.-) y /2 yji 

V V ||F(io)ll - l|F(*o)|l \/ 2 A 

. This proves the left side of (A). The right hand inequality is proved similarly, using (d) 
and (f). 



LEMMA 2 The class C s ( /x, A) is esslinear. 
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PROOF Let f be a potential function corresponding to F. Every algorithm B(F) is now 
also an algorithm A(f). Every function F £ C s (/i, A) is the gradient of some / £ C(ft, A). 
Hence for some F , ||i ? (a:* ; )|| / ||F(a:o)|| converges more slowly than {e~ i ' l / R ^) k ^ 2 /2Q. Now 
take for the algorithm B the gradient algorithm mentioned above. Again by LEMMA 1 
every function F will converge under B with at least a linear rate. 

Since C s (p,\) is a subset of C 1 (/r, A), then for some F £ C 1 (/x, A), ||F , (xr-)|| / ||i r '(xo)|| 
converges more slowly than (e~( 1 / R )) k / 2 /2Q. Now for B(F) we take the algorithm GRAD2 
below. This algorithm converges linearly. Whence we have 

THEOREM 2. The class is sublinearly convergent. 

We now restrict the class C l (p, A) to enlarge the possibility of faster convergence. Let 
C l (p, A, L) denote a map F£ C 1 (/x, A) for which ||-D(a:) — D(y)|| < L ||a: — y||, for all x.y £ 
H. The following well- known theorem is adjusted for our present setting. 

THEOREM 3 KANTOROVICH(194S) Take zo £ H.. Let p(x 0 ) = ||(r»- 1 (a 0 ))|| and 
r/(xo) = ||(D -1 (a:o))F(a:o)||- Assume that ||D(:r) — ZA(y) || < A(x 0 )||a: — y|| for all pairs x.y 
in the ball B(x o) = {x £ H : \\x — xo|| < 2rj(xo)}. If ti(x 0 )/3(xq)A(x 0 ) = h(x 0 ) < 1/2, 
then F has a root z such that z is in the ball B(x o), the Newtonian iterates x } defined by 
xj + 1 = xj — D~ 1 {xj)F{xj) lie in B(x 0 ), and \\xj — z\\ < 2 1 ~j (2h(x 0 )) 21 ~ 1 i'j(x 0 ). 

A convenient terminology similar to Smale’s is that under the above circumstances ,r 0 is 
an “approximate root”. 

In what follows we shall take h( xq) = 1/4. 

REMARK 3. The condition for an approximate root, tj(x 0 )/3(xo)A(xo) < 1/4 has the 
equivalent condition for an approximate root as: 

11^(10)11 < a(xo) = l/[4/)( I „)A( I „)||I)- 1 (x„)F( I „)/||F( I „)|| ||) 



REMARK 4. We have for all x £ H global estimates for r/(x), fl(x), and A{x). Namely: 
fl(x) < l/y, rj(x) < [^(xjH/y and A(x) < L. From these estimates we get: 

a(x) > y 2 /4 L = a 

If ||i ? (a:)|| < a then x is an approximate root , and 
rj(x ) < p/AL. 
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In many problems a is so small that the desired accuracy tolerance is achieved before an 
approximate root is achieved. Thus the efficiency of a global Newton method in reaching 
an approximate root is a crucial question. We now turn to our version of Smale’s algorithm 
which we shall denote by “SGN”. In what follows a(;r,) will be denoted by cq. 



REMARK 5. In what follows the constants //, A, and L need not be finite over the entire 
space H, but rather on the set S = {x (E H : ||i r ’(a;)|| < l|F(* 0 )||}. 



LEMMA 3. Assume F € A) and /r 0 is arbitrarily given in H. If ||i ? (a;o)|| < <x 0 then 

£0 is an approximate root. If not we define a sequence xq, 1 1 , x\, t?, inductively as 

follows. Given X{ set 



fi+i — 



11^)11 



0 ) 



Choose to satisfy 



||F(x,- +1 )-< I - +1 F(x i )||<«i/2 (6) 

Then 

l|F(WII<IW*o)l|-(i + lW2 

where a is defined as in Remark 4. 

PROOF. We show first that x z+1 can be chosen to satisfy (b). Let G{(x) = F(x) — 
ti+iF(x{). Since Gi(x { ) == a 2 -, X{ is an approximate root for G { , because F f = G 1 . A few 
Newton steps (we count them below) suffices to obtain 1 such that ||G,*(a:,-+i)|| < oc x j 2. 
Thus (b) can be satisfied. Using the triangle inequality on (a) together with (b) we 
get that ||F(x i+1 )|| < ||F(a; f )|| - a,-/2. Whence ||F(x,- +1 )|| < ||F(x 0 )|| - \ Ey=o a J - 
F(x 0 ) — (i + l)a/2. Now choose i so that ||F(x,-+i)|| < a 

CLAIM 1 Let N be the least integer exceeding 2(||F(xo)|| —Oi)/a. Then for some i < N , X{ 
is an approximate root of F. 

We now estimate the number of Newton steps to move from X{ to aq+i. 

LEMMA 4 Let {yij} be a sequence of Newtonian iterates starting at y,o = X{. Let G{(z{) = 
0. Then we can choose Xi+i = ym where K is the least integer > 1.443 In (1.443 In 8Q). 

PROOF We have seen that Xi is an approximate zero for Gi hence ||y,y — Zj|| < 

Then 

l|Gi(yy) - C,( 2j )ll < A||(yij - 2 ,|| < \fiL~' (l) 2 ' . 

Now choose K so that A fiL~ 1 (^) 2K < a/2, that is (I) 2 * < 1/(8Q). 
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REMARK 6. The above algorithm can be optimized by changing the right hand side of 
inequality (b) in the recursion above to a/q with q > 1. The formula for N becomes 
(||F(xo)|| — <*)?/<*) and K becomes 1.433 In (1.433 In 4 qQ). Now choose q to minimize NK. 



LEMMA 5 Take F € C 1 (y, X, L) Assume that D(x) is self-adjoint for all x 6 H. The 
gradient method previously mentioned below REMARK 2, called Algorithm A, that we 
shall now call “GRADl” will, starting at Xo, generate an approximate root in K steps, 
where K is the smallest integer > Q In [||F(xo)||4QT/p 2 ] 

PROOF The mapping G defined by G(y ) = y — F(y)/A has a fixed point z satisfying F(z) 
= 0. It is a contractor satisfying a Lipschitz condition q — 1 — 1 jQ. Goldstein(1967, pps 
15 and 24). Set G(x n ) = x n+x = x n - F(x n )/A. Then ||F(x„) - F(z)|| < A||x n - z\\ < 
Ay n ||xo — ||/(1 — q) = ||F(x 0 )||? n /(l — q). Now choose n so that 

l|F(*o)||? n < «(1 - q) 

Using the inequality — ln(l — y/X ) > y/X, one obtains the lemma. 

We now consider a gradient method for the non-symmetric case. We call this gradient 
method GRAD2. 

ALGORITHM GRAD2 Take F € C 1 (y, A, L), x 0 e S and set f(x) = ||F(x)|| 2 . Then V/(x) 
satisfies ||V/(x) - V/(y) || < M\\x - y\\ for all x and y in S, with M = 2(A 2 + ||F(x 0 )||F). 
Set 4>(x) = /(x)V/(x)/|| V/(x)|| 2 . Given arbitrary xo in H set x/; +1 = x*. — Jo <f>(xk) with 
70 = y 2 /2M. If k exceeds 

2 / ||F(r,)||£ | Qi) li y il F (*°)l|4r ^ 

then Xk is an approximate root. 

PROOF Adding and subtracting (F'(y))* F(x) we find that ||V/(x) — V /(y)|| < 2(A 2 + 
l|F(*)|| L) ||x - 2/||, and /(x) - f(x - j<f>(x)) = 7 [V/(x), *(x)] + j[Vf(x) - V/(£), <f>(x)}. 
Here ||£ - x|| < 7M||^(x)||. Then f(x - 7 <f>(x)) < f(x) - jf(x) + j 2 M\\<j)(x)\\ 2 , V/(x) = 
2 (F'(x))*F(x), and ||V/(x)|| > 2\\F(x)\\y. Then /(x* +1 ) < /(x,)[ 1 - 7 o + J 2 0 M/ly 2 } = 
/(xjt)(l — y 2 /M). Taking square roots we get that ||F(xfc+i)|| < ||F(x^-)||(l — y 2 /2M). 
Finally, choose k so that ||F(xo)||(l — y 2 /2M) k < y 2 /4 L, 

Comparing the algorithms. Let 

r lir(^o)lir 

By claim 1 and lemma 4 the total number of steps of SGN is: 
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SGN : 11.544(5- .25) ln(1.443 In 8Q). 



GRAD 2 : 2(5 + Q 2 ) In 45 
GRADl : Q\n{4SQ) 

Notice that unlike GRADl and GRAD2, SGN is insensitive to the condition number Q! 
However SGN is sensitive to S. GRAD 1 is sensitive to Q but not to S. GRAD2 is sensitive 
to both of these factors. In the symmetric case for fixed Q, GRADl is quicker than SGN 
when ||.F(:ro)|| grows sufficiently large or if L/fj , 2 gets sufficiently large. On the other 
hand for fixed S, SGN is quicker as Q grows sufficiently large. In the non-symmetric case 
SGN is superior to GRAD2 with respect to the number of steps. When the cost per 
step is included, the gradient methods become cheaper in the n-dimensional case when n 
is sufficiently large. For each Newton step an nxn system of linear equations is solved, 
costing 0(n 3 ) multiplications. While the corresponding GRAD2 step involves a matrix 
multiplication of an nxn and a nxl matrix, or n 2 multiplications, and GRADl requires no 
matrix operations. 
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